Giulia Carella
SDSC20 Workshop - Integrating CARTOframes into Spatial Data Science workflows. October 21 2020
Site selection refers to the process of deciding where to open a new or relocate an exiting store/facility by comparing the merits of potential locations. If some attribute of interest is known (e.g. the revenues of existing stores) statistical and machine learning methods can be used to help with the decision process.
In this demo, we will look at where Starbucks should open new coffee shops in Long Island, NY. Specifically, we will build a model for the revenues of the existing stores as a funcion of sociodemographic data and use this model to predict the expected revenue in each block group.
We will:
First of all, install virtualenv inside your project folder to secure your project directory to avoid conflict with your other packages.
pip install virtualenv
After installing this run this command one by one inside your root project directory:
virtualenv venv
source venv/bin/activate
Now Your directory is secure and you can install your required packages inside.
pip install -r requirements.txt
To exit the virtual environment type:
deactivate
from requirements import *
from ppca import PPCA
from utils import *
Go to your CARTO dashboard anc click on API KEY to get your key.
carto_username = 'XXXXXX'
carto_API = 'XXXXXX'
set_default_credentials(username=carto_username, api_key=carto_API)
First we upload the Starbucks data that are stored in your local in the starbucks_long_island.csv file. The data contains the addresses of Starbucks stores, their annual revenue ($) and some store characteristics (e.g. the number of open hours per week).
stores = pd.read_csv('./data/starbucks_long_island.csv')
stores.head(3)
| name | addresslines | hours_open_per_week | gc_status_rel | has_lu | has_cl | has_wa | has_vs | zipcode | storenumber | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7000 Austin Street | 7000 Austin Street,Forest Hills, NY 11375 | 108.0 | 0.97 | 1 | 1 | 1 | 1 | 11375 | 7217-623 | 1.016898e+06 |
| 1 | 107-12 Continental Avenue | 107-12 Continental Avenue,Forest Hills, NY 11375 | 116.5 | 0.95 | 1 | 0 | 1 | 0 | 11375 | 7349-910 | 4.470646e+05 |
| 2 | Stop & Shop-Forest Hills #539 | 8989 Union Turnpike,Forest Hills, NY 11375 | 112.0 | 0.97 | 0 | 0 | 1 | 0 | 11375 | 73708-104729 | 7.028071e+05 |
Next, we will plot the distribution of the annual revenues by store.
pandas_hist(stores, 'revenue',bins = 20, textsize = 15, title = 'Annual Revenue ($) by Starbucks store')
Since we only know the addresses, we first need to geocode them to extract the corresponding geographical locations, which will be used to assign to each store the Census data of the corresponding block group.
geocoder = Geocoding()
stores = geocoder.geocode(stores, street='addresslines').data
stores.crs = {'init' :'epsg:4326'}
stores.to_crs(epsg=4326, inplace=True)
Success! Data geocoded correctly
to_carto(stores,'starbucks_long_island_geocoded', if_exists = 'replace')
Success! Data uploaded to table "starbucks_long_island_geocoded" correctly
'starbucks_long_island_geocoded'
bk = [i*10**5 for i in [4, 7, 10, 13, 16]]
Map(
Layer('starbucks_long_island_geocoded',
style = size_bins_style('revenue',
breaks = bk,
size_range = [1, 30],
color = '#00eaff',
stroke_color = '#006bff'),
legends = size_bins_legend(title='Annual Revenue ($)',
description='STARBUCKS',
footer =''),
),show_info = True,
viewport={'zoom': 10, 'lat': 40.646891, 'lng': -73.869378},
size=(920,300)
)
Next, we will download from CARTO DATA Observatory (www.carto.com/data) the demographic and socioeconomic variables that we will use to build a model for the stores revenues. We will use data from the The American Community Survey (ACS), which is an ongoing survey by the U.S. Census Bureau. The ACS is available at the most granular resolution at the census block group level, with the most recent geography boundaries available for 2015. The data that we will use are from the survey run from 2013 to 2017.
As we are interested only in the data for the Long Island area, we first upload the geographical boundaries of this area from the manhattan_long_island.geojson file, which is stored locally.
aoi = gpd.read_file('./data/manhattan_long_island.geojson')
Map(
Layer(aoi,
style = basic_style(color = '#eaff00',
stroke_color = '#eaff00',
stroke_width = 2,
opacity = 0.2
),
),show_info = True,
viewport={'zoom': 7.5, 'lat': 40.845419, 'lng': -72.841376},
size=(920,300)
)
More documentation here: https://carto.com/developers/cartoframes/guides/Data-discovery/
Let's start by exploring the Data Catalog to see what data categories CARTO can offer for data in the US.
Catalog().country('usa').categories
[<Category.get('behavioral')>,
<Category.get('covid19')>,
<Category.get('demographics')>,
<Category.get('derived')>,
<Category.get('environmental')>,
<Category.get('financial')>,
<Category.get('housing')>,
<Category.get('human_mobility')>,
<Category.get('points_of_interest')>,
<Category.get('road_traffic')>]
Then check the available providers and geometries and select the most recent ACS data at the block group level
Catalog().country('usa').category('demographics').datasets.to_dataframe().provider_id.unique()
array(['ags', 'worldpop', 'usa_acs', 'usa_bls', 'mbi', 'gbr_cdrc',
'experian'], dtype=object)
catalog_usa_demo = Catalog().country('usa').category('demographics').datasets.to_dataframe()
catalog_usa_demo_acs = catalog_usa_demo[catalog_usa_demo.provider_id == 'usa_acs']
catalog_usa_demo_acs.geography_name.unique()
array(['Census Tract - United States of America (2015)',
'Core-based Statistical Area - United States of America (2019)',
'Public Use Microdata Area - United States of America (2015)',
'Census Place - United States of America (2015)',
'Congressional District - United States of America (2015)',
'State - United States of America (2015)',
'School District (secondary) - United States of America (2015)',
'School District (unified) - United States of America (2015)',
'County - United States of America (2015)',
'School District (elementary) - United States of America (2015)',
'5-digit Zip Code Tabulation Area - United States of America (2015)',
'Census Block Group - United States of America (2015)'],
dtype=object)
catalog_usa_demo_acs_bk = catalog_usa_demo_acs[catalog_usa_demo_acs.geography_name == 'Census Block Group - United States of America (2015)']
catalog_usa_demo_acs_bk.id.to_list()
['carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017', 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20122016', 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20082012', 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20072011', 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20112015', 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20102014', 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20092013', 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20062010']
And explore in greater detail the metadata
Dataset.get('carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017').to_dict()
{'slug': 'acs_sociodemogr_b758e778',
'name': 'Sociodemographics - United States of America (Census Block Group, 2017, 5yrs)',
'description': 'The American Community Survey (ACS) is an ongoing survey that provides vital information on a yearly basis about the USA and its people. This dataset contains only a subset of the variables that have been deemed most relevant. More info: https://www.census.gov/programs-surveys/acs/about.html',
'category_id': 'demographics',
'country_id': 'usa',
'data_source_id': 'sociodemographics',
'provider_id': 'usa_acs',
'geography_name': 'Census Block Group - United States of America (2015)',
'geography_description': 'Shoreline clipped TIGER/Line boundaries. More info: https://carto.com/blog/tiger-shoreline-clip/',
'temporal_aggregation': '5yrs',
'time_coverage': '[2013-01-01, 2018-01-01)',
'update_frequency': 'yearly',
'is_public_data': True,
'lang': 'eng',
'version': '20132017',
'category_name': 'Demographics',
'provider_name': 'American Community Survey',
'geography_id': 'carto-do-public-data.carto.geography_usa_blockgroup_2015',
'id': 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017'}
Geography.get('carto-do-public-data.carto.geography_usa_blockgroup_2015').to_dict()
{'slug': 'cdb_blockgroup_7753dd51',
'name': 'Census Block Group - United States of America (2015)',
'description': 'Shoreline clipped TIGER/Line boundaries. More info: https://carto.com/blog/tiger-shoreline-clip/',
'country_id': 'usa',
'provider_id': 'carto',
'geom_type': 'MULTIPOLYGON',
'update_frequency': None,
'is_public_data': True,
'lang': 'eng',
'version': '2015',
'provider_name': 'CARTO',
'id': 'carto-do-public-data.carto.geography_usa_blockgroup_2015'}
Finally get the data and geometries
df_X = Dataset.get('acs_sociodemogr_b758e778')
q = "select * from $dataset$ where ST_Intersects(geom,st_geogfromtext('{}'))".format(aoi.geometry.astype(str)[0])
df_X = df_X.to_dataframe(sql_query=q)
df_X['geometry'] = df_X['geom'].apply(lambda x: str_to_geom(x))
df_X = GeoDataFrame(df_X,geometry = df_X.geometry)
df_X.crs = {'init' :'epsg:4326'}
df_X.to_crs(epsg=4326, inplace=True)
df_X.head()
| geoid | do_date | total_pop | households | male_pop | female_pop | median_age | male_under_5 | male_5_to_9 | male_10_to_14 | ... | pop_in_labor_force | not_in_labor_force | armed_forces | civilian_labor_force | do_label | do_perimeter | do_area | do_num_vertices | geom | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 360470340002 | 2013-01-01 | 720.0 | 139.0 | 298.0 | 422.0 | 83.1 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 720.0 | 0.0 | 0.0 | Block Group 2 | 909.463 | 47648.525 | 9 | POLYGON((-73.994726 40.57152, -73.993577 40.57... | POLYGON ((-73.99473 40.57152, -73.99358 40.571... |
| 1 | 360471058042 | 2013-01-01 | 784.0 | 586.0 | 316.0 | 468.0 | 80.0 | 0.0 | 0.0 | 0.0 | ... | 14.0 | 770.0 | 0.0 | 14.0 | Block Group 2 | 1206.945 | 89274.575 | 9 | POLYGON((-73.889591 40.651338, -73.889994 40.6... | POLYGON ((-73.88959 40.65134, -73.88999 40.651... |
| 2 | 360810085002 | 2013-01-01 | 79.0 | 35.0 | 54.0 | 25.0 | 32.8 | 0.0 | 0.0 | 3.0 | ... | 52.0 | 13.0 | 0.0 | 52.0 | Block Group 2 | 1480.301 | 126149.190 | 10 | POLYGON((-73.94435 40.756875, -73.942822 40.75... | POLYGON ((-73.94435 40.75688, -73.94282 40.756... |
| 3 | 360811417005 | 2013-01-01 | 142.0 | 54.0 | 60.0 | 82.0 | 44.5 | 0.0 | 0.0 | 0.0 | ... | 64.0 | 56.0 | 0.0 | 64.0 | Block Group 5 | 2193.062 | 156898.112 | 17 | POLYGON((-73.797231 40.741696, -73.797268 40.7... | POLYGON ((-73.79723 40.74170, -73.79727 40.740... |
| 4 | 360470900001 | 2013-01-01 | 401.0 | 153.0 | 106.0 | 295.0 | 36.5 | 0.0 | 0.0 | 58.0 | ... | 136.0 | 134.0 | 0.0 | 136.0 | Block Group 1 | 1214.444 | 42970.556 | 7 | POLYGON((-73.918261 40.66857, -73.917221 40.66... | POLYGON ((-73.91826 40.66857, -73.91722 40.664... |
5 rows × 171 columns
Map(
Layer(df_X,
geom_col = 'geometry',
encode_data=False,
style = basic_style(color = '#eaff00',
stroke_width = 1,
opacity = 0.2
),
),show_info = True,
viewport={'zoom': 10, 'lat': 40.646891, 'lng': -73.869378},
size=(920,300)
)
When comparing data for irregular units like census block groups, extra care is needed for extensive variables, i.e. one whose value for a block can be viewed as a sum of sub-block values, as in the case of population. For extensive variables in fact we need first to normalize them, e.g. by dividing by the total area or the total population, depending on the application. Using the metadata available in CARTO Data Observatory we will check which of all the variables can be considered extensive by looking at their default aggregation method: if the aggregation method is classified as SUM, then the variable is normalized by dividing by the total population.
## Get the default aggregation methods
agg_methods_table = read_carto("""
SELECT column_name,
agg_method
FROM variables_public
WHERE dataset_id = 'carto-do-public-data.usa_acs.demographics_sociodemographics_usa_blockgroup_2015_5yrs_20132017'""",
credentials = Credentials(username='do-metadata', api_key='default_public'))
agg_methods_table.dropna(inplace = True)
agg_methods_table.head()
| column_name | agg_method | |
|---|---|---|
| 0 | male_18_to_19 | SUM |
| 1 | associates_degree | SUM |
| 2 | income_100000_124999 | SUM |
| 3 | families_with_young_children | SUM |
| 4 | two_parents_not_in_labor_force_families_with_y... | SUM |
## Compute densities wrt to total population
for i in agg_methods_table.column_name:
if((agg_methods_table[agg_methods_table.column_name==i].agg_method.iloc[0] == 'SUM') & (i!='total_pop')):
df_X[i + '_dens'] = df_X[i].div(df_X['total_pop'])
df_X.drop(i, axis = 1, inplace = True)
## Drop unused columns
drop_cols = df_X.loc[:, df_X.dtypes != np.float64].columns
drop_cols = [i for i in df_X.columns if 'do_' in i or i in drop_cols and i not in ['geoid','geometry']] + ['total_pop']
df_X.drop(drop_cols, axis = 1, inplace = True)
df_X_cols = list(df_X.drop(['geoid','geometry'], axis = 1).columns)
Missing data are common in surveys, as the ACS. Before using the ACS data as covariates we therefore need to check if there are any missing data and to decide how to impute the missing values.
For each variable, the figure below shows the data points that are missing (corresponding to the rows in white). We notice that the amount of missing observations is typically small but depends on the variable.
fig, ax = plt.subplots(figsize=(60,10), nrows =1, ncols = 1)
msno.matrix(df_X[df_X_cols], labels = True, ax = ax)
<matplotlib.axes._subplots.AxesSubplot at 0x1c262576d8>
Next, we will check the pair-wise correlations between the ACS variables. Correlated covariates are problematic for inference in linear (and generalized linear) models as they lead to inflation of the variance of the regression coefficient (https://en.wikipedia.org/wiki/Variance_inflation_factor) but do not necessarly represent a problem for prediction. However, if correlated variables are present, we can take advantage of these correlations to reduce the dimensionality of the model and save computation time.
Looking at the pair-wise correlations of a random sample of 50 variables, we notice large correlations.
draw_corrm(df_X[random.sample(df_X_cols, 50)])
To impute the missing data and reduce the model dimensionality we will use a probabilistic formulation of Principal Component Analysis (PCA). PCA is a technique to transform data sets of high dimensional vectors into lower dimensional ones that finds a smaller-dimensional linear representation of data vectors such that the original data could be reconstructed from the compressed representation with the minimum square error.
In its probabilistic formulation, probabilistic PCA (PPCA) the complete data are modelled by a generative latent variable model which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components are not observed and are treated as latent variables, which also allows to extend the method to when missing data are present.
More information can be found at http://www.jmlr.org/papers/volume11/ilin10a/ilin10a.pdf.
PCA can also be described as the maximum likelihood solution of a probabilistic latent variable model, which is known as PPCA:
\begin{equation*} Y_{ij} = \mathbf{P}_i \, \mathbf{Z}_j + m_i + \varepsilon_{ij} \quad i = 1, .., d; \, j = 1, .., n \end{equation*}with
\begin{align} p(\mathbf{Z}_j) \sim N(0, \mathbb{1}) \\ p(\varepsilon_{ij}) \sim N(0, \nu) \end{align}Both the principal components $Z$ and the noise $\varepsilon$ are assumed normally distributed. The model can be identified by finding the maximum likelihood (ML) estimate for the model parameters using the Expectation-Maximization (EM) algorithm by minimizing the mean-square error of the observed part of the data. EM is a general framework for learning parameters with incomplete data which iteratively updates the expected complete data log-likelihood and the maximum likelihood estimates of the parameters. In PPCA, the data are incomplete because the principal components, $Z_i$, are not observed and are treated as latent variables. When missing data are present, in the E-step, the expectation of the complete-data log-likelihood is taken with respect to the conditional distribution of the latent variables given the observed variables. In this case, the update EM rules are the following.
E-step: \begin{align} \mathbf{\Sigma}_{\mathbf{Z}_j} = \nu \left(\nu \, \mathbb{1} + \sum_i \mathbf{P}_i \mathbf{P}_i^T \right)^{-1} \\ \overline{\mathbf{Z}}_j = \dfrac{1}{\nu}\mathbf{\Sigma}_{\mathbf{Z}_j} \sum_i \mathbf{P}_i \left(Y_{ij}- m_i \right) \\ m_{i} = \dfrac{1}{n} \sum_j \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \right) \\ \end{align}
M-step: \begin{align} \mathbf{P}_{i} = \left( \sum_j \overline{\mathbf{Z}}_j \overline{\mathbf{Z}}_j ^T + \mathbf{\Sigma}_{\mathbf{Z}_j} \right)^{-1} \sum_j \overline{\mathbf{Z}}_j \, \left(Y_{ij}- m_{ij} \right)\\ \nu = \dfrac{1}{n} \sum_{ij} \left[ \left(Y_{ij} - \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j - m_i \right)^2 + \mathbf{P}_i^T \, \overline{\mathbf{Z}}_j \mathbf{P}_i \, \right] \end{align}
where each row of $\mathbf{P}$ and $\overline{\mathbf{Z}}$ is recomputed based only on those columns of $\overline{\mathbf{Z}}$ which contribute to the reconstruction of the observed values in the corresponding row of the data matrix.
X, var_exp = run_ppca(df_X, df_X_cols, min_obs = 0.8)
X.head()
| geoid | median_income | income_per_capita | renter_occupied_housing_units_paying_cash_median_gross_rent | owner_occupied_housing_units_lower_value_quartile | owner_occupied_housing_units_median_value | owner_occupied_housing_units_upper_value_quartile | median_year_structure_built | median_rent | percent_income_spent_on_rent | ... | pc_124 | pc_125 | pc_126 | pc_127 | pc_128 | pc_129 | pc_130 | pc_131 | pc_132 | pc_133 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 360470340002 | 11341.0 | 12284.0 | 247.0 | NaN | NaN | NaN | 1994.0 | 181.0 | 29.2 | ... | 0.052882 | 0.035244 | -0.272179 | 0.540095 | -0.499536 | 0.166496 | 0.691058 | 0.098954 | -0.018632 | -0.042992 |
| 1 | 360471058042 | 12646.0 | 11530.0 | 322.0 | NaN | NaN | NaN | 1989.0 | 301.0 | 29.3 | ... | -1.904357 | -0.149651 | -0.506206 | 0.635975 | 0.621000 | -0.024523 | -1.078909 | 0.049090 | -0.054530 | -0.537447 |
| 2 | 360810085002 | NaN | 11085.0 | 1631.0 | NaN | NaN | NaN | 1939.0 | 1620.0 | 50.0 | ... | 0.128417 | -0.208665 | -1.130804 | -0.263379 | 0.319421 | 0.191368 | 0.272441 | 0.642315 | 0.075760 | -0.099563 |
| 3 | 360811417005 | 45833.0 | 20522.0 | 1319.0 | NaN | NaN | NaN | 1969.0 | 1319.0 | 37.2 | ... | -0.117807 | 0.107891 | 0.448599 | -0.142086 | 0.125943 | 0.253386 | -0.219790 | 0.042805 | 0.242542 | 0.248561 |
| 4 | 360470900001 | 41417.0 | 19454.0 | 1175.0 | 354000.0 | 483300.0 | 623100.0 | 1965.0 | 1175.0 | 50.0 | ... | 0.102713 | 0.205192 | 0.258857 | 0.183891 | 0.318649 | 0.259706 | -0.200628 | -0.599910 | -0.332049 | -0.094779 |
5 rows × 298 columns
First, we compute the explained variance as a function of the number of PC scores retained to decide how many PCs we should keep. We decide to keep the PCs up that explain up to 80% of the variance (the black dashed line in the plot below), but this choice is somewhat arbitrary and might vary depending on the application.
plot_pc_var(var_exp, textsize = 15, title = 'PPCA explained variance ratio')
To understand the relashionship between the transformed variables (the PC scores) and the original variable, we can plot the 10 variables most highly correlated with each PC, as shown in the bar plot plot which shows the correlation for each of these variables. For example, we see that the first PC, which is that that explains the most of the variance, is positively correlated with the density of ownner occupied housing units but negatively correlated with the density of renter occupied housing units, as also the map below shows.
Note: depending on the random seed on your computer, the signs of the correlations might change.
plot_pc_corr(X, df_X_cols)
Map(
Layer(X,
encode_data=False,
style = color_bins_style('pc_0',
palette = 'SunsetDark',
stroke_width = 1,
opacity = 0.5),
legends = color_bins_legend(title='First Principal Component Score (PC0)',
description='',
footer ='ACS'),
popup_hover=[popup_element('pc_0', title='First Principal Component Score (PC0)')]
),viewport={'zoom': 10, 'lat': 40.646891, 'lng': -73.869378})
cum_var_exp = np.cumsum(var_exp)
ncomponents = np.where(cum_var_exp > 0.8)[0][0]
X = X[['geoid','geometry'] + ['pc_' + str(j) for j in range(ncomponents)]]
stores_enriched = gpd.sjoin(stores[['revenue','the_geom']], X, how='right', op='intersects').drop('index_left', axis = 1)
stores_enriched.head()
| revenue | geoid | geometry | pc_0 | pc_1 | pc_2 | pc_3 | pc_4 | pc_5 | pc_6 | ... | pc_57 | pc_58 | pc_59 | pc_60 | pc_61 | pc_62 | pc_63 | pc_64 | pc_65 | pc_66 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 360470340002 | POLYGON ((-73.99473 40.57152, -73.99358 40.571... | -0.695606 | -7.093879 | 14.704991 | 8.243893 | -2.187761 | -7.241100 | -2.413142 | ... | -0.268878 | 2.092414 | -0.486674 | 1.294718 | 0.997972 | 0.742453 | 0.638661 | -2.005902 | -0.971642 | -0.082802 |
| 1 | NaN | 360471058042 | POLYGON ((-73.88959 40.65134, -73.88999 40.651... | -7.305653 | -0.174047 | 23.960472 | 14.237168 | -0.238021 | -2.775248 | -1.813125 | ... | -3.376382 | -2.490287 | 1.216573 | -2.858679 | -3.792763 | 3.535218 | -1.904688 | -2.395826 | -1.354692 | 2.334104 |
| 2 | NaN | 360810085002 | POLYGON ((-73.94435 40.75688, -73.94282 40.756... | -7.318997 | 2.315952 | -1.726563 | 0.096251 | 2.881854 | -5.502071 | -1.158724 | ... | 0.222345 | -1.493035 | 0.501758 | 2.744042 | 0.328252 | 0.334574 | -0.817075 | 1.506422 | -0.050899 | 2.149819 |
| 3 | NaN | 360811417005 | POLYGON ((-73.79723 40.74170, -73.79727 40.740... | -3.066791 | -2.135412 | 4.192196 | 3.922212 | 0.556181 | 0.387589 | -4.297234 | ... | -4.836646 | 0.745295 | 0.416194 | 2.262863 | 0.037322 | 0.635615 | 1.237348 | -1.546747 | 1.471088 | -2.790024 |
| 4 | NaN | 360470900001 | POLYGON ((-73.91826 40.66857, -73.91722 40.664... | -4.270007 | -4.786713 | 2.856504 | -0.202370 | -3.069812 | 4.154868 | 5.214297 | ... | 1.810704 | 3.607572 | 1.436051 | 0.608898 | -0.105599 | -0.349638 | -2.234146 | -1.385064 | -1.613662 | -1.255643 |
5 rows × 70 columns
Having prepared the model covariates, then we will model the annual revenues by store ($\mathbf{y}$) using a Generalized Linear Model (GLM) with the selected PC scores as covariates ($\mathbf{X}$).
\begin{equation*} E\big[\mathbf{y}\vert\mathbf{X}\big] = g^{-1}\big(\sum_j \; X_j * beta_j \big) \end{equation*}where $E$ is the expectation value of the observations $\mathbf{y}$ conditional on the covariates $\mathbf{X}$ and $g$ is a link function. Given the distribution of the annual revenues, we will use a Gamma likelihood with a logarithmic link function.
df1 = stores_enriched[~np.isnan(stores_enriched['revenue'])]
lin_scale = 10**5
pc_cols = [i for i in df1.columns if 'pc_' in i]
y = df1['revenue'].values/lin_scale
X = df1[pc_cols]
X0 = np.ones((X.shape[0],1))
X = np.hstack((X,X0))
sm_fit = sm.GLM(y,X, family=sm.families.Gamma(link=sm.families.links.log)).fit()
df1['y_hat'] = sm_fit.predict(X)*lin_scale
Looking at the model results we notice a significant positive correlation with the first PC (which is higher in richer areas) and a significant negative correlation with the second and third PCs which are positive correlated with workforce density and density of female seniors respectively.
Note: the individual correlations with the PC and therefore with the original variables, might depend on the random seed on your computer but will be consistent overall.
print(sm_fit.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 177
Model: GLM Df Residuals: 109
Model Family: Gamma Df Model: 67
Link Function: log Scale: 0.034290
Method: IRLS Log-Likelihood: -349.01
Date: Wed, 21 Oct 2020 Deviance: 3.7283
Time: 17:28:21 Pearson chi2: 3.74
No. Iterations: 20 Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
x1 0.1808 0.006 28.202 0.000 0.168 0.193
x2 -0.0498 0.007 -7.428 0.000 -0.063 -0.037
x3 -0.0648 0.007 -9.193 0.000 -0.079 -0.051
x4 -0.0066 0.010 -0.634 0.526 -0.027 0.014
x5 0.0454 0.012 3.788 0.000 0.022 0.069
x6 -0.0153 0.013 -1.188 0.235 -0.041 0.010
x7 0.0119 0.020 0.599 0.549 -0.027 0.051
x8 0.0045 0.028 0.158 0.875 -0.051 0.060
x9 0.0253 0.017 1.459 0.145 -0.009 0.059
x10 -0.0059 0.016 -0.362 0.718 -0.038 0.026
x11 0.0081 0.017 0.476 0.634 -0.025 0.041
x12 -0.0122 0.019 -0.640 0.522 -0.050 0.025
x13 0.0138 0.018 0.768 0.443 -0.021 0.049
x14 -0.0100 0.019 -0.522 0.602 -0.047 0.027
x15 -0.0035 0.033 -0.106 0.915 -0.068 0.061
x16 -0.0864 0.031 -2.805 0.005 -0.147 -0.026
x17 -0.0121 0.023 -0.527 0.598 -0.057 0.033
x18 0.0146 0.020 0.737 0.461 -0.024 0.053
x19 -0.0076 0.020 -0.376 0.707 -0.047 0.032
x20 0.0243 0.018 1.372 0.170 -0.010 0.059
x21 -0.0091 0.022 -0.408 0.683 -0.053 0.035
x22 0.0148 0.019 0.769 0.442 -0.023 0.052
x23 0.0437 0.022 1.964 0.049 9.67e-05 0.087
x24 -0.0080 0.018 -0.443 0.658 -0.043 0.027
x25 0.0036 0.020 0.182 0.855 -0.035 0.042
x26 0.0248 0.020 1.263 0.207 -0.014 0.063
x27 0.0113 0.021 0.535 0.593 -0.030 0.053
x28 -0.0368 0.021 -1.737 0.082 -0.078 0.005
x29 0.0101 0.019 0.531 0.595 -0.027 0.047
x30 -0.0525 0.022 -2.380 0.017 -0.096 -0.009
x31 -0.0332 0.022 -1.539 0.124 -0.075 0.009
x32 -0.0365 0.020 -1.838 0.066 -0.075 0.002
x33 0.0442 0.018 2.402 0.016 0.008 0.080
x34 0.0051 0.021 0.239 0.811 -0.036 0.046
x35 -0.0296 0.020 -1.480 0.139 -0.069 0.010
x36 -0.0191 0.024 -0.778 0.437 -0.067 0.029
x37 0.0143 0.022 0.652 0.514 -0.029 0.057
x38 0.0048 0.020 0.237 0.812 -0.035 0.044
x39 -0.0148 0.021 -0.693 0.488 -0.057 0.027
x40 -0.0293 0.022 -1.308 0.191 -0.073 0.015
x41 0.0333 0.023 1.458 0.145 -0.011 0.078
x42 -0.0219 0.025 -0.872 0.383 -0.071 0.027
x43 -0.0026 0.024 -0.107 0.915 -0.050 0.044
x44 0.0158 0.021 0.750 0.453 -0.026 0.057
x45 0.0242 0.024 0.999 0.318 -0.023 0.072
x46 0.0030 0.024 0.127 0.899 -0.044 0.050
x47 -0.0323 0.022 -1.448 0.148 -0.076 0.011
x48 -0.0033 0.021 -0.153 0.878 -0.045 0.039
x49 -0.0200 0.025 -0.793 0.428 -0.070 0.029
x50 0.0107 0.020 0.537 0.591 -0.028 0.050
x51 -0.0408 0.023 -1.785 0.074 -0.086 0.004
x52 -0.0066 0.023 -0.290 0.772 -0.051 0.038
x53 0.0064 0.024 0.268 0.788 -0.040 0.053
x54 -0.0037 0.022 -0.168 0.866 -0.047 0.039
x55 0.0057 0.022 0.265 0.791 -0.036 0.048
x56 0.0146 0.020 0.725 0.469 -0.025 0.054
x57 -0.0385 0.022 -1.713 0.087 -0.083 0.006
x58 -0.0119 0.023 -0.510 0.610 -0.058 0.034
x59 -0.0114 0.022 -0.515 0.606 -0.055 0.032
x60 0.0381 0.021 1.811 0.070 -0.003 0.079
x61 -0.0290 0.025 -1.150 0.250 -0.078 0.020
x62 0.0060 0.028 0.213 0.831 -0.049 0.061
x63 -0.0227 0.020 -1.139 0.255 -0.062 0.016
x64 -0.0037 0.024 -0.154 0.878 -0.050 0.043
x65 0.0343 0.024 1.430 0.153 -0.013 0.081
x66 0.0100 0.022 0.458 0.647 -0.033 0.053
x67 -0.0056 0.023 -0.243 0.808 -0.050 0.039
const 2.2616 0.025 89.760 0.000 2.212 2.311
==============================================================================
To asses the model accuracy, we then plot the observed vs. the predicted values and compute the pseudo-$R^2$ score which is defined as
\begin{equation*} 1 - \dfrac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} \end{equation*}As shown by the plot, below which indicates that most of the data points lie on the 1-1 line and a high value of the pseudo-$R^2$ score, we can conclude that our model is accurate in the in-sample predictions.
pandas_scatter(df1,'revenue','y_hat', xlab = 'Observed', ylab = 'Predicted', pseudo_R2 = True)
While GLM model typically do not overfit, we can also check the pseudo-$R^2$ score of the training and testing errors.
X_train, X_test, y_train, y_test, idx_train, idx_test \
= train_test_split(X, y, np.arange(X.shape[0]), test_size=0.2, random_state=42)
fitglm = sm.GLM(y_train,X_train, family=sm.families.Gamma(link=sm.families.links.log)).fit()
df1_train = df1.iloc[idx_train]
df1_test = df1.iloc[idx_test]
print('Training pseudo-r2 score: ' + str(pseudo_r2score(y_train,fitglm.predict(X_train))))
Training pseudo-r2 score: 0.94
print('Testing pseudo-r2 score: ' + str(pseudo_r2score(y_test,fitglm.predict(X_test))))
Testing pseudo-r2 score: 0.81
Finally, we can use this model to predict the annual revenue ($) for each block group in Long Island.
df2 = stores_enriched.copy()
y = df2['revenue'].values/lin_scale
X = df2[pc_cols]
X0 = np.ones((X.shape[0],1))
X = np.hstack((X,X0))
df2['revenue_pred'] = sm_fit.predict(X)*lin_scale
With CARTOframes you can then share your visualizations by publishing your map using a url.
to_carto(df2,'starbucks_long_island_predictions', if_exists = 'replace')
Success! Data uploaded to table "starbucks_long_island_predictions" correctly
'starbucks_long_island_predictions'
bk = [i*10**5 for i in [4, 7, 10, 13, 16, 20, 30, 40]]
pubmap = Map(
[Layer('starbucks_long_island_predictions',
style = color_bins_style('revenue_pred',
breaks = bk,
palette = 'SunsetDark',
stroke_width = 1,
opacity = 0.5),
legends = color_bins_legend(title='Annual Predicted Revenue ($)',
description='',
footer =''),
popup_hover=[popup_element('revenue_pred', title='Annual Predicted Revenue ($)')],
widgets=[histogram_widget('revenue_pred',
title='Annual Predicted Revenue ($)',
description='Select a range of values to filter',
buckets=15)]),
Layer('starbucks_long_island_geocoded',
style = size_bins_style('revenue',
breaks = bk,
size_range = [1, 30],
color = '#00eaff',
stroke_color = '#006bff'),
legends = size_bins_legend(title='Annual Revenue ($)',
description='STARBUCKS',
footer ='2019'))
],show_info = True,
viewport={'zoom': 9.4, 'lat': 40.673790, 'lng': -73.992685},
size=(920,300)
)
pubmap
pubmap.publish("starbucks_long_island",
password=None,
if_exists='replace')
The map has been published. The "cartoframes_12026fa313a1e0f2b864adbac15af74b" Maps API key with value "Bl4uIU5IhE6uKroAcolCAg" is being used for the datasets: "starbucks_long_island_geocoded", "starbucks_long_island_predictions". You can manage your API keys on your account.
{'id': '155344a5-b53f-43d9-a50f-6e7ed5bda173',
'url': 'https://sc-giulia.carto.com/u/sc-giulia-admin/kuviz/155344a5-b53f-43d9-a50f-6e7ed5bda173',
'name': 'starbucks_long_island',
'privacy': 'public'}
Having assessed the performance of the non-spatial GLM model, we can check if the model residuals ${y_i-\hat{y_i}}$ show any residual spatial dependence, indicating that better results might be obtained using a model that explicitly takes the property that an observation is more correlated with an observation collected at a neighboring location than with another observation that is collected from farther away.
Including a spatially-structured random effects model that incorporates such spatial dependency is beneficial when the variables used as covariates in the model may not be sufficient to explain the variability of the observations and the measurements, given the predictors, are not independent. Including a spatially-structured random effect in the model helps mitigating the situation.
A spatial stochastic process can be decomposed as:
\begin{equation*} y(\mathbf{s}) = \mu(\mathbf{s}) + w(\mathbf{s}) + \varepsilon \end{equation*}where $\mu(\mathbf{s})$ is the mean function (deterministic and smooth), $\varepsilon$ is an independent process and
\begin{equation*} w(\mathbf{s}) \sim GP\left(0, C(\mathbf{s}, \mathbf{s'}, \mathbf{\theta})\right) \end{equation*}where the covariance $C(\mathbf{s}, \mathbf{s'}, \mathbf{\theta})$ typically depends on the Euclidean length of the spatial separation vector only $|\mathbf{h}| = |\mathbf{s} - \mathbf{s'}|$, and potentially on some hyperparameters $\mathbf{\theta}$. For example the exponential covariance function is defined as:
where the empirically derived definition for the range is $\rho = \dfrac{\sqrt{8 \nu}}{\kappa}$.
with the covariance between two points tending to zero as the points become further separated in space.
Fitting spatial models, comes with some caveats:
Variograms represent a useful tool to check for spatial dependence in the residuals. The variogram is defined as:
\begin{equation*} Var\left[y(\mathbf{s + h}) - y(\mathbf{s})\right]^2 =: 2\gamma(\mathbf{h}) = 2 \left(C \left(0\right) - C \left(|\mathbf{h}|\right)\right)\end{equation*}First, we will construct an empirical variogram by grouping the pairs of observations in bins based on their distance (if isotropy is assumed) and averaging the squared differences from the values for all pairs. Given a semivariogram model, we can then fit this model to the empirical semivariogram and estimate the model parameters.
coords_x = df1['geometry'].apply(lambda x: x.centroid.x)
coords_y = df1['geometry'].apply(lambda x: x.centroid.y)
coordinates = pd.concat([coords_x, coords_y], axis=1).values
values = (df1['y_hat']-df1['revenue'])/lin_scale
vg = Variogram(coordinates=coordinates,
values=values,
normalize=False,
use_nugget=True,
model='exponential',
maxlag="mean",
n_lags=15)
ax = Variogram_plot(vg, fig_title='Residual variogram')
The variogram models parameters are:
print(vg)
exponential Variogram
---------------------
Estimator: matheron
Effective Range: 0.37
Sill: 4.70
Nugget: 0.99
We will now extend the GLM model to account for the residual spatial dependence, by fitting this model:
\begin{equation*} E\big[\mathbf{y}\vert\mathbf{X}\big] = g^{-1}\,\Big(\sum_j \; X_j * beta_j + u(\mathbf{s}_i) + \varepsilon_i \Big) \end{equation*}where $u(\mathbf{s}_i)$ represents the spatially structured residuals modelled as a Gaussian Process with an exponential covariance function and $\varepsilon_i$ ad IID random term.
We will fit this model in a Bayesian framework using the programming language Stan (https://mc-stan.org). However, to reduce the model complexity we will fix the covariance parameters and use those derived from the variogram analysis (in a fully Bayesian approach these would be estimated).
The stan model is saved locally in _stan/GP_gamma_modelexact.stan.
y = stores_enriched['revenue'].values/lin_scale
X = stores_enriched[pc_cols]
X0 = np.ones((X.shape[0],1))
X = np.hstack((X,X0))
coords_x = stores_enriched['geometry'].apply(lambda x: x.centroid.y)
coords_y = stores_enriched['geometry'].apply(lambda x: x.centroid.y)
coords = np.vstack((coords_x.values,coords_y.values)).T
y1 = y[~np.isnan(y)]
X1 = X[~np.isnan(y)]
coords1_x = df1['geometry'].apply(lambda x: x.centroid.y)
coords1_y = df1['geometry'].apply(lambda x: x.centroid.y)
coords1 = np.vstack((coords1_x.values,coords1_y.values)).T
pho, sigma, tau = vg.parameters
## Define the data dictionary the model
data = {'N': X1.shape[0],'K': X1.shape[1],
'X': X1, 'y': y1, 'coords':coords1,
'pho':pho, 'sigma':sigma, 'tau':tau}
## Compile the model
#stanm = pystan.StanModel(file='./stan/GP_gamma_model_exact.stan')
## Train the model and generate samples
#stan_fit = stanm.sampling(data=data, iter=1000, chains=3,
# warmup=500, thin=1, seed=101)
## Save the model
#with gzip.open("./stan/GP_gamma_model_exact.gz", "wb") as f:
# pickle.dump({'model' : stanm, 'fit' : stan_fit}, f, protocol=-1)
## Load the model
with gzip.open("./stan/GP_gamma_model_exact.gz", "rb") as f:
data_dict = pickle.load(f)
stan_fit = data_dict['fit']
Extract the posterior mean and standard deviations
df1['y_hat_m'] = stan_fit.extract()['y_rep'].mean(axis=0)*lin_scale
df1['y_hat_se'] = stan_fit.extract()['y_rep'].std(axis=0)*lin_scale
pandas_scatter(df1,'revenue','y_hat_m', xlab = 'Observed', ylab = 'Predicted', pseudo_R2 = False)
_ = plt.plot([df1.revenue,df1.revenue],
[df1['y_hat_m']-df1['y_hat_se'], df1['y_hat_m']+df1['y_hat_se']], 'b-', linewidth = 0.3,linestyle = '--')
plt.xlim((df1['y_hat_m']-df1['y_hat_se']).min(),(df1['y_hat_m']+df1['y_hat_se']).max())
(86054.29278814871, 7641334.01044289)
We fit now the variogram mode to the model residuals, using the posterior mean.
Compared to the previous variogram plot, we can see that by adding a spatially structured random effect we have consistently reduced the spatial dependence in the residuals (check the scale of the y-axis).
## Fit variogram model
values = (df1['y_hat_m']-df1['revenue'])/lin_scale
coordinates = coords1
vg = Variogram(coordinates=coordinates,
values=values,
normalize=False,
use_nugget=True,
model='exponential',
maxlag="mean",
n_lags=15)
ax = Variogram_plot(vg, fig_title='Residual variogram')
Warning: 'harmonize' is deprecated and will be removedwith the next release. You can add a 'SKG_SUPPRESS' environment variable to suppress this warning. Warning: compiled_model is deprecated and will be removed. Use Variogram.fitted_model instead. You can add an SKG_SUPPRESS environment variable to supress this warning.
Clearly using the above model we have showed that we have removed the spatial dependence in the residuals. The next step would then be to use this model to make the predictions for the annual revenues for each census block group. However, because the computation time for GPs scales cubically with the data locations, we need some approximations which goes beyond the scope of this workshop. More information can be found here: